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Abstract 

An important and yet difficult problem in fitting multivariate mixture models is deter- 
mining the mixture complexity. We develop theory and a unified framework for finding 
the nonparametric maximum likelihood estimator of a multivariate mixing distribution and 
consequently estimating the mixture complexity. Multivariate mixtures provide a flexible 
approach to fitting high-dimensional data while offering data reduction through the num- 
ber, location and shape of the component densities. The central principle of our method 
is to cast the mixture maximization problem in the concave optimization framework with 
finitely many linear inequality constraints and turn it into an unconstrained problem using 
a penalty function. We establish the existence of parameter estimators and prove the con- 
vergence properties of the proposed algorithms. The role of a "sieve parameter" in reducing 
the dimensionality of mixture models is demonstrated. We derive analytical machinery for 
building a collection of semiparametric mixture models, including the multivariate case, 
via the sieve parameter. The performance of the methods are shown with applications to 
several data sets including the cdcl5 cell-cycle yeast microarray data. 
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1 Introduction 



Multivariate mixture modeling is a bridge between clustering and nonparametric multi- 
variate density estimation. The estimated multivariate mixture model provides both an 
estimate of the density for the overall data and partitions the data into several compo- 
nents or clusters. Determining the mixture complexity is challenging even in one dimen- 
sion (Laird, 1978; Jewell, 1982; Titterington et al., 1985; Lesperance and Kalbfleisch, 1992; 
Roeder, 1994; Lindsay, 1995; McLachlan and Peel, 2001; Pilla and Loader, 2003; Scott, 
2004a; Pilla and Charnigo, 2005). A fundamental problem in high-dimensional modeling 
is determining the number of components and their centers. One popular model-free ap- 
proach to high-dimensional modeling is the K-means algorithm (see Hastie et al. (2001) and 
the references therein). Model-based techniques such as density estimation (Scott, 1992; 
James et al., 2001; Scott, 2004b) and multivariate mixture models provide a reliable and 
flexible approach to high-dimensional modeling while providing a data reduction through 
the number, location and shape of the component densities. In the context of mixture 
models, the problem becomes determining the number of mixture components and estimat- 
ing the corresponding location parameter vectors. Furthermore, mixture models provide 
much of the flexibility of the nonparametric approaches, while retaining many advantages 
of the parametric approaches (Laird, 1978; Roeder, 1992; Lesperance and Kalbfleisch, 1992; 
Lindsay, 1995; Charnigo and Pilla, 2005; Scott, 2004b). 

One of the main reasons for the popularity of model-free approaches such as the K-means 
algorithm for high-dimensional modeling is the lack of a unified and powerful technique 
for fitting multivariate mixtures. The focus of this article is to develop analytical ma- 
chinery for building a collection of semiparametric mixture models, including the multivari- 
ate case. The theory and methods developed in this article have applications to image 
analysis, high-dimensional clustering and data mining, to name a few. The practical ap- 
plications of semiparametric mixture models are broad and include case-control studies 
with errors- in- variables (Roeder et al., 1996), random effects models and empirical Bayes 
method (Lindsay, 1995). A natural outcome of applying a multivariate mixture model for 
high-dimensional clustering is that (1) each cluster is statistically represented by a para- 
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metric distribution; for instance, normal in continuous case and Poisson in discrete case and 
(2) it provides the proportion of observations in each cluster through the estimated mixture 
probability. Furthermore, statistical tests can be developed easily based on the parameter 
estimators of the multivariate mixture models to answer various scientific or biological ques- 
tions. Due to high levels of noise inherent in many of the massive data sets, including the 
microarray technology, it is highly desirable to carry out the high-dimensional data analysis 
within a statistical framework. 

Let m := |supp(Q)| be the size of the support set of a mixing measure Q; i.e., the mixture 
complexity. If Q is finitely supported, m is finite and otherwise it is infinite. The current 
standard approach for finding the maximum likelihood estimator (MLE) of Q, when m is 
known a priori or fixed, is the well-known EM algorithm developed in the seminal article 
by Dempster et al. (1977). 

In the absence of the knowledge of mixture complexity, it is instructive to start with an 
overparameterized mixture model and search over the whole continuous parameter space 
effectively to obtain a parsimonious mixture model. Overparameterization here refers to 
fitting a model with many components relative to the actual number in the nonparametric 
MLE (NPMLE) of Q; hence, there is a redundancy of components in the mixture model. 
Such a scheme would be robust to parameter starting values chosen for fitting the mixture 
model. To accomplish this, one requires a powerful mixture algorithm that pushes most of 
the mixture probabilities to zero, which is on the boundary of the parameter space. The 
focus of this article is to develop theory and create robust (to starting values) as well as 
powerful algorithms to address this problem. 

The popular and widely employed EM-based algorithms are particularly slow for fitting such 
an overparameterized mixture model and it is very difficult, if not impossible, to remove 
the unnecessary components; since the algorithm can never reach such a boundary point. 
Moreover, the EM algorithm is sensitive to parameter starting values and fails to converge in 
certain mixture problems. Figure 4 in Section 7.1 demonstrates this aspect of the algorithm. 
Other examples where the EM algorithm converges to saddle points or fails to converge are 
noted by McLachlan and Krishnan (1997). 
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1.1 Statistical Framework 



Let T := {fg(x) : 6 £ £1 C K p } be a family of probability density functions with respect 
to a cr-finite dominating measure fi for a p-dimensional random vector x € X C and 
a p-dimensional location parameter vector € C a measurable space. Assume that 
the component density fg(x) is bounded in for each x £ X. Let Q be the space of all 
probability measures on Q with the u-field generated by its Borel subsets. For a given 
Q £ Q, we assume that data vector x arises from the marginal density 



which is referred to as a mixture density. The mixture model (1) is also applicable to 
empirical Bayes estimation, where Q is an unknown prior distribution and the objective 
becomes estimation of the posterior distribution of without assuming a functional form 
for the prior distribution. 

The goal is to estimate the mixing measure Q by finding the probability measure Q £ Q that 
maximizes the nonparametric mixture loglikelihood log {L(Q)} = Y27=i {§q ( x »)i- ^ * s 
well known that finding the NPMLE of Q is computationally intensive (Lesperance and Kalbfleisch, 
1992; Roeder, 1992; Lindsay, 1995; Bickel et al., 1998; Susko et al., 1999). Although Q is 
an arbitrary probability measure, under mild conditions Lindsay (1983a, b) showed that 
finding the MLE involved a standard problem of convex optimization, that of maximizing 
a concave function over a convex set. One consequence of this is that, as long as 1{Q) 
is bounded, the MLE of Q is concentrated on a support of cardinality at most that of 
d — the number of distinct observed data vectors. This is a very useful, albeit surprising, 
result since a potentially difficult nonparametric estimation problem is reduced to that 
of a finite dimension; hence algorithms can be constructed to find the solution. Hence, 
we restrict the attention to discrete probability measures Q having p-dimensional sup- 
port vectors 9±, . . . ,G m collected in a matrix with a corresponding vector of masses 
or mixing probabilities denoted by 7r = (tti, . . . ,vr m ) T such that 7r is in the unit simplex 



We consider model fitting for both discrete and continuous data. Therefore, it is instructive 
to define ig := {/e(yi), • • • , fe(yd)} T to be the ci-dimensional vector of distinct likelihood 
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for x £ X 



(1) 



II := {-7T € !K m : ttj G [0, !],£ 




ITj = 1}. 



terms, where T denotes transpose, (yi,---,y<z) € y are the distinct observation vectors 
arising from the original data vectors (xi, . . . ,x n ) € X. Let rij be the number of times yj 
occurs in the sample of x vectors. For continuous data, n = d and for discrete data, often 
n > d. 

The observed data matrix of dimension (nxp) with row vectors yj (i = 1, . . . , n) is assumed 
to arise from the mixture density g Q (yj) = Y^=i 71 j foj (y«)i an d the discrete mixing measure 
can be represented as Q = ^2,- ttj g(0j), where g(6) is a discrete measure with mass one at 
€ 0. If m is fixed, the model g g (yj) will be referred to as the m-component mixture model, 
and one can always find the NPMLE of Q using m equal to d (Lindsay, 1983a, b). However, 
the actual number of distinct support vectors with positive mixture probability, referred to 
as active supports, can be as small as one. The mixture loglikelihood of Q becomes 

d 

l(Q) = logL(Q) = ^2 rii log {g Q (yj)} overall QeQ. (2) 

i=i 

The goal is to find Q € Q such that l(Q) = supg 66 l(Q). 

The biggest practical problem one faces in solving the loglikelihood equations in (2) is that 
the number of inequalities is equal to the number of elements of the parameter space f2. 
There are some important problems where this number is finite, although possibly very 
large, such as in target recognition, hyperspectral image analysis and positron emission 
tomography. For these problems discretization of the parameter space is directly relevant. 
In other problems, may be a continuous space; hence, one needs a machinery for approx- 
imating the parameter space to solve these equations (see Section 2.1). 

1.2 Main Results 

In this article, we develop a unified framework for finding the NPMLE of a multivariate 
mixing distribution Q and consequently for building a collection of semiparametric mixture 
models. The key ingredients for building these models are the "sieve parameter" controlling 
the dimensionality of the mixture problem (as shown in Figures 2 and 3) and the ability to 
fit overparameterized mixture models. This collection of models enable us to investigate the 
role of many overlapping densities, thereby creating an ideal situation for solving large-scale 
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practical problems. 

We create a powerful technique referred to as the "Penalized Dual method" and an efficient 
algorithm for fitting overparameterized mixture models. Consequently we have a method for 
estimating the mixture complexity. This algorithm is a step in the direction of developing 
a unified framework for building a collection of semiparametric mixture models. 

The underlying principle for our method is that the mixture loglikelihood forms a concave 
functional on the convex set of all probability distributions which implies that there exists a 
dual optimization problem [Section 5.3, Lindsay (1995)]. Lesperance and Kalbfleisch (1992) 
and Susko et al. (1999) exploited this to create an elegant algorithm for finding the NPMLE 
of Q in univariate mixtures. This research is in the same spirit but extends these ideas by 
introducing a "penalty term" . A fundamental feature of our approach is that it eliminates 
ad hoc procedures to estimate the penalty parameter. The dual problem has a statistical 
interpretation analogous to the least squares problem and the formulation is strikingly 
similar to the one that arises in empirical likelihood (Owen, 2001) framework (see Section 
3.1). 

Our main results are summarized as follows. 

1. In Section 2, we cast the mixture problem in the dual optimization framework. We first 
develop a machinery for approximating the continuous parameter space and next 
create an algorithm (based on the Penalized Dual method) for finding the maximum of 
l(Q)- Consequently, we propose an algorithm for estimating the mixture complexity. 
We establish the convergence of this algorithm to MLE in Section 5. 

2. In Section 3, we develop theory for the Penalized Dual method in solving the dual 
optimization problem while presenting the statistical interpretation for our framework. 
By exploiting the inherent advantage of the penalty formulation, we derive a technique 
for converting parameter estimators from the Penalized Dual problem into the mixture 
probability parameters. We show that the Penalized Dual estimators converge to the 
mixture probability estimators as the penalty is increased, with the correct limits. 

3. Section 4 establishes the existence of parameter estimators and derives convergence re- 
sults for the Penalized Dual algorithm, for fitting overparameterized mixture models. 



The Penalized Dual algorithm effectively yields an estimate for the mixture complex- 
ity. Our algorithm is based on a modification of the Newton-Raphson algorithm and 
therefore, it inherits its virtues while retaining the stability (i.e., monotonically in- 
creasing the likelihood) of EM-based algorithms. Empirical assessment of the faster 
rate of convergence of our stable and powerful algorithm compared to the EM algo- 
rithm will be demonstrated in Section 5. 

4. It is shown that the algorithm based on the Penalized Dual method is robust to 
choice of parameter starting values and achieves the global maximum. The dimension 
of the dual optimization problem is fixed at d, the number of distinct observed data 
vectors, whereas for the mixture problem it grows with the mixture complexity m. 
For discrete mixture problems, such as binomial or Poisson, d can be much smaller 
than n (see for example Section 7.1). In these cases, the Penalized Dual method has 
no dimensionality cost. 

5. In Section 6, we derive several important structural properties of multivariate normal 
mixtures in which Q is modeled nonparametrically in the presence of an unknown 
variance-covariance matrix S € S common to all m components, where S is a compact 
space. The power of our method rests in building a collection of semiparametric 
mixture models, including the multivariate case. We demonstrate the role of the 
sieve parameter in reducing the dimension of the mixture problem by creating novel 
graphical devices (Figures 2 and 3) referred to as Mixture Tree Plots. 

6. When the cardinality of the discrete parameter set (chosen for approximating the 
continuous parameter space O) is large, the EM algorithm for such a mixture problem 
fails to converge to the MLE, for all practical purposes, while the Penalized Dual 
algorithm converges. From a model selection point of view, the EM algorithm does 
not eliminate the redundant components while the Penalized Dual algorithm yields a 
parsimonious mixture model. Empirical evidence of this is shown in Figure 4, Section 
7.1. 

7. Section 7 illustrates the power of the proposed methods using several applications. 
We compare our method with the EM algorithm, due to lack of a unified and/or 
stable algorithms for fitting the overparameterized mixture problems and for building 
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semiparametric mixture models. For the univariate mixture case, we compared with 
Rotated EM algorithm (an accelerated version of the EM only applicable to the uni- 
variate mixtures) proposed by Pilla and Lindsay (2001). Our empirical investigation 
demonstrate the faster rate of convergence of our algorithm, compared with the EM 
algorithm. 

Section 8 presents the conclusions and the Appendix derives technical details. 
1.3 Relevant Literature 

Widely employed model-free methods for high-dimensional modeling include the K-means 
algorithm, hierarchical clustering and agglomerative and divisive algorithms (Hastie et al., 
2001). However, none of these techniques take advantage of the inherent statistical structure 
of the data. 

The existing model-based mixture algorithms include those for finding the NPMLE of Q 
(Lesperance and Kalbfleisch, 1992; Susko et al., 1999; Connolly et al., 2001). These algo- 
rithms are either not fast enough for high-dimensional modeling or not applicable for the 
following mixture problem: (1) the component densities are poorly separated and/or (2) 
many of the estimated mixture probabilities are on the boundary of the parameter space. 
In analyzing Sloan Digital Sky Survey data by fitting the multivariate normal mixtures, 
Connolly et al. (2001) noted that many existing techniques are not computationally ef- 
ficient and their mixture EM algorithm obtains an improvement of only three orders of 
magnitude. Therefore, developing a powerful method for fitting multivariate mixtures is 
desirable. 

Although there have been some promising developments on accelerating the EM algorithm 
(see McLachlan and Krishnan (1997) and the references therein), none of these methods ad- 
dress the overparameterized mixture problem described earlier. To overcome the above dif- 
ficulties, Pilla and Lindsay (1996, 2001) proposed alternative augmentation schemes based 
on the principles of the EM that provide a significantly improved convergence rate of the 
EM algorithm for a class of finite mixture models. At this time, it is not clear how to extend 
these methods to multivariate mixture models; however, they do provide an important class 
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for comparison with our algorithm in univariate mixture problems (comparisons are made 
in Section 7.1). Lindsay (1995, Section 6.3) discusses several algorithmic methods based 
on directional derivatives such as the vertex direction method and vertex exchange method 
to find the NPMLE of Q. These methods also require searching over a discrete parameter 
space and have certain computational disadvantages (Lesperance and Kalbfleisch, 1992). 

2 Mixture Maximum Likelihood Problems 

In this section we first formulate the mixture problem as a convex optimization problem 
and next create a framework for approximating the continuous parameter space. Lastly, 
we develop an algorithm for finding the NPMLE of Q. This algorithm forms the basis for 
building a collection of semiparametric mixture models developed in Section 6. 

2.1 Maximizing l(Q) via Approximating Q 

If the number of components in Q is fixed, but the location parameter vectors are unknown, 
then l(Q) can have several local maxima (Lesperance and Kalbfleisch, 1992; Lindsay, 1995; 
Pilla and Lindsay, 2001). Both the EM and the K-means algorithms can get trapped at 
a local maximum while requiring a priori knowledge of the mixture complexity m. To 
overcome this problem, researchers often randomly perturb the parameter starting values 
and recompute the local maxima (Hall and Zhou, 2003; Hunter, 2004). However, there is no 
theoretical justification to guarantee that the resulting solution reaches closer to the global 
maximum. 

In fact, random parameter starting values can fail in the mixture context for the following 
reasons. First, one requires an a priori knowledge of the number of components m. Second, 
there is a danger of choosing multiple starting values from one component while ignoring 
to choose any from other components. In such a case, the EM algorithm may not neces- 
sarily be able to locate the component from which no parameter values are selected. This 
problem becomes severe when components of unequal sizes are present; see Section 7.3 for 
an empirical investigation of this aspect. 
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To combat the difficulties with the parameter starting value problem and an a priori knowl- 
edge of m, we develop a technique in which we approximate Q by the set of all probability 
measures on a discrete parameter space of Q. 

Approximating Q: Approximate Q by Q m , where Q m is a set of discrete distributions generated 
by a finite subset of Vt. We set this finite subset to be ® m = (Gi, . . . , 9 m ). As m — > oo 
and @ m becomes dense in f2, the set Q m — > Q. In practice, a sufficiently large m is chosen 
such that Q m approximates Q well. Therefore, the cardinality of m , namely m, determines 
how close the MLE is to the global MLE over all measures on Q. In approximating Q, it 
is important to select a suitable © m while keeping computations manageable. This will be 
addressed in Section 6.3. 

In what follows, we distinguish between the three mixture problems. 
1. The fixed support mixture problem is equivalent to maximizing 



over the parameter space II while treating the support set @ m C £1 as fixed. This is 
the primal or mixture problem for which we define a "dual" in Section 3. Note that 
dim(7r) = (m — 1) and grows with the cardinality of & m , which is a major obstacle 
when dim(0 m ) is large. However, the dimension of our dual optimization problem is 
fixed at d, the number of distinct observed data vectors. 

2. We fix the number of components in the mixing distribution Q to be m but treat the 6 
parameter vectors as unknown for each component. Therefore, the continuous support 
mixture model problem becomes simultaneously estimating 7r and 6 parameter vectors 
by maximizing l(Q) over II x & m for a fixed m. 

3. In the absence of knowledge of mixture complexity m, maximizing the mixture loglike- 
lihood in (2) yields an NPMLE that is a discrete distribution on the parameter space 
with a random number of component densities (Lindsay, 1995; Pilla and Lindsay, 
2001). This will be referred to as the nonparametric mixture model. The goal in turn 
becomes finding the probability measure Q £ Q that maximizes (2). 



d 




(3) 



i=l 
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For discrete mixture problems, such as binomial or Poisson, often d <C n; therefore, the 
dual methods are able to reduce the dimension of the mixture problem. The effect of this 
dimensionality on the performance of the algorithms will be demonstrated in Section 7.1. 



2.2 Characterization of the NPMLE of Q 

Let r be a curve in $l d consisting of all vectors of the form {/e(yi), • • • , fe(yd)}, where 
6 fi. Under compactness of T, we can define the convex hull of T as Conv(r) = {g s : 
Q G G, Q has finite support}, where g Q = {g Q (yi ),..., g Q (yd)} T ■ The optimal vector 
gg = {gg (yi ),..., g~ (yd) } T £ Conv(r) and a corresponding maximizing measure Q can 
be characterized in terms of the gradient function as shown next. 

Definition 1 (Finite identifiability): For a given family J 7 , suppose that Qi,Q2 € G have 
finite support. Suppose that Qj G G yields the mixture density g Q .(y) for j = 1,2. If 
Sq x (y) = §q 2 (y) f° r all y € 3^ implies Q\ = Q2, then the corresponding collection of 
mixture densities is said to have the finite identifiability property. 

An important aspect of our technique is based on the following fundamental property. 
For the NPMLE Q, the ith fitted model gg(y^) is guaranteed to be unique (regardless of 
identifiability of the mixture density), and that one can determine these fitted values by 
solving for the residual uii, on a log-scale, defined as 

log (wi) := log (— j - log{g g (yi)} for y» G y, i = 1, . . . , d. (4) 

In ordinary parametric likelihood problems the solution is characterized by the likelihood 
equations. We extend these ideas to our problem to show that the fitted values and the 
corresponding mixing distribution Q G G can be further characterized in terms of a set of 
gradient equations. That is, Q is an NPMLE if and only if 

* (O) = sup V Q (0) < for Q G G, (5) 

where the gradient function, the directional derivative of the mixture loglikelihood in the 
direction of a component density, is defined as 

V Q (0):=Y,n l \^p\-l) for 6 G ® m . (6) 

i=l LSswiJ J 
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If a candidate maximizing measure Q violates the gradient inequality in (5) at some 6 £ m , 
then one is not at the maximum. In particular, one can increase the loglikelihood by placing 
some positive probability at £ m . 

2.3 Finding the Maximum of l(Q) via & m 

For a fixed m, mixture estimation is challenging due to the fact that 1{Q) is not concave 
and hence there are several local maxima (Lesperance and Kalbfleisch, 1992; Lindsay, 1995; 
McLachlan and Peel, 2001; Pilla and Lindsay, 2001). We create an algorithm that is robust 
to the choice of parameter starting values and reaches closer to the global maximum of /(Q). 

We find the NPMLE of Q adaptively as follows. 
Algorithm 1 [Finding the Maximum of l(Q)] 

1. Consider m C to be the support set of Q. Solve the fixed support mixture problem 
by maximizing (3) over the parameter space II on the support set m , while treating 
flee m as fixed. It is worth noting that the larger the cardinality of m , the higher 
the value of the loglikelihood at convergence. 

2. Apply the MLE 7v (with the corresponding fixed support set m C fi) obtained in 
Step 1, as parameter starting values for the continuous support mixture problem and 
maximize (2) over the product parameter space II x m . 

For Step 1, one requires a stable and powerful mixture algorithm and is derived in the next 
sections. In particular, Algorithm 2 presented in Section 4.2 can be employed in Step 1. 
The Step 2 may include estimation of other parameters in the model such as S £ S, in the 
multivariate normal mixtures context. 

For the continuous support mixture model, it will be shown in Section 7 that Algorithm 1 
reaches closer to the global maximum, if not to the global maximum. Our empirical evidence 
suggests that Algorithm 1 is superior to EM-type algorithms that start with random (or 
arbitrary) parameter values. 
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3 The Dual Optimization Problem: Properties of Estimators 



We now present the problem that is dual to the primal problem (3) considered by Lindsay 
(1983a) and develop theory for effectively solving it. The dual problem is to maximize 

d 

l{w) =^2 rii log (uti) (7) 
i=i 

subject to the constraints w = (wi, . . . , Wd) T £ and 

d 

X>i/e(y*)<i fOT »ea (8) 

i=l 

Let w € be the solution to the above dual (or concave) optimization problem. The 
solution satisfies the relationship (4), so that solving the dual problem for w is equivalent 
to finding the log-scale residuals. Hence, indirectly, via (4), we obtain the model fitted values 
g _ . A challenging step is that one must solve for the parameter estimates for the model from 
these fitted values. We create a method that exploits the particular choice of our penalty 
term. Note that the constraints are linear in the parameter vector w £ and that the 
number of free parameters equals d, while the number of constraints equals the cardinality 
of m . The dual optimization is with respect to w whose dimension equals d. This is 
especially advantageous with large data sets containing, say, thousands of observations (see 
Section 7.1). In Appendix A.l, we establish the relationship between the primal and dual 
problems at the solution. 

3.1 Statistical Interpretation of the Dual Problem 

The formulation in (7) and (8) is strikingly similar to the one that arises in empirical 
likelihood framework (Owen, 2001) in which the function Z(w) is maximized over a similar 
set of linear constraints. The empirical likelihood problem also has a dual problem, although 
it does not appear to be computationally useful. 

There is a natural interpretation of the dual problem that is analogous to the linear model 
framework. In an application of the least squares problem, one finds the fitted values y 
directly by projecting the data y £ y onto the model space X (i.e., P^y = y) or solves 
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for the residual e by projecting y onto the orthogonal complement of the model space 
(i.e., Pjiy = e, where _L denotes the orthogonal projection). In turn, we solve for the 
fitted values using y = (y — e). The primal and dual problems have the same relationship 
as the projection and complementary projection of linear models. The parallel with the 
linear model framework holds if we let the data y« equal log (rij/ro), the fitted model y~i 
equal log{gg(yj)} and the log-scale residual e, equal log (wi). This approach again falls 
very much into the spirit of the empirical likelihood, where (rii/n) is the NPMLE of the 
probability of observing yj G y. 

3.2 The Penalized Dual Method: Theory 

The goal in this section is to turn the constrained dual optimization problem defined in (7) 
and (8) into an unconstrained one using a "penalty function". This is referred to as the 
Penalized Dual method. Our method is in the spirit of the log-barrier method (Renegar, 
2001) for convex programming; however it differs in two important respects as will be shown. 

The Penalized Dual method maximizes 

d 

H 7 (w) = 1 £(^)log(w i )-V(w, 1 ) (9) 

i=l 

over w € where 7 is a tuning parameter and "P(w, 7) is a penalty function that ensures 
that the Penalized Dual solution does not violate the constraints; the dual solution always 
stays in the interior of the constraint set. One choice for the penalty function is 

1 m 7 

7>(w,7) = -£{p 9j -( w )} 7 for ^6© m and 7 eK + , (10) 
7 3=1 

where the penalty parameter 7 is some large power and the constraint function is defined as 

d 

iwi: iy,; -o. (11) 

i=i 

That is, the dual problem constraints have the form p g (w) < 1. We first show that by 
increasing 7, V(yv, 7) will eventually create an infinite penalty on any w € that violates 
the constraints and advances the solution towards the dual problem solution. 

The proofs for our technical results are derived in the Appendix. 
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Proposition 1 For a given € f2 and w € the term in the summand of the penalty 
function V(w, 7) satisfies: 

00 if p e (w) > 1, 

if < p g (w) < 1 

as 7 — > 00. When p e (w) > 1, the penalty function is increasing in 7 for 7 > {log p e (w)} _1 . 
If p e (w) < 1, the penalty function is decreasing in 7 for all 7 G K+. 

Two main elegant features of our penalty function are the following: (1) We can directly 
construct an estimator for ir parameters from the penalized dual solution. (2) It is simple 
to calculate the gradient function to assess the algorithmic convergence using the relation 
(20), defined in Section 3.4. 

It is common in the optimization literature to employ a "barrier function" to build the 

penalty. For example, the log-barrier function defined as 

m 

7\(w,7) := -7^1og{l -p g .(-w)} for j = l,...,m 

5=1 

approaches —00 as w € approaches the boundary of the feasible set from the interior 
(Roos et al., 1997; Renegar, 2001). The effect of the penalty can be diminished by making 7 
close to 0. Our focus here is on a soft penalty of the form (10) which is well behaved outside 
the feasible set; however, as will be shown, it does force the solution into the interior. 

The penalized problem is unconstrained; therefore, we can find the "Penalized Dual optimal 

rp 

estimator" denoted w 7 = (u7 i7 , . . . ,^,7) , given by (A. 6) in Appendix A. 2, by solving 

9^ W ) = ^-£K-( W )} ^ = fOT « = 1,-.* (12) 

* * i=i 

For 7 = 1, there exists an explicit solution to the above equation as 

-1 



7=1 = ^ |X>i(y*)[ for i = 1 ,---,d. (13) 



This is an initial interior point solution for the algorithm. On the other hand, the conven- 
tional log-barrier methods do not automatically produce a starting value for the "barrier 
parameter" . 
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3.3 Existence of Parameter Estimators 



We consider the following method to solve for the ir parameters from the dual problem 
solution by exploiting the penalized structure. 

Recovering the Primal Estimators: Using (4), the model fitted values are found from 
the penalized dual solution w. However, such a solution does not immediately provide an 
estimator for iv € II and the technique for obtaining it is derived next. 

1. Restrict attention to 9j (j = 1, . . . , m) in & m C £1, the support set of Q 6 Q, for 
which the constraints are tight to ensure Yli=i ^i,-y foj (y«) = 1- 

2. Solve for 7r using the linear equations YljLi % fOj (Yi) = ( n i/ n )A^vy f° r each i = 
l,...,d. 

The penalized dual residuals are used to obtain a natural estimator for the mixture or 
primal problem, denoted by 7r* = (v? lj7 , • • • j^i )7 ) T '■ The statistic, which is referred to as 
the Penalized Dual estimator is 



1 7 

(w 7 )| for j = l,...,m, (14) 



Hi = ' 
where 

d 

P B] (w 7 ) = Yl ®in fs s (yO- ( 15 ) 

1=1 

In Appendix A. 2, it is shown that the estimator Wi tJ , derived in (A. 6), can be approximated 
in terms of {p g (w 7 )} However, these latter quantities with the power (7 — 1) do not 

sum to one, and hence are turned into a candidate estimator via normalization: 

x {p 9j (w 7 )} (7 } / 
7Tj i7 = --— jy for j = l,..., m. (16) 

This candidate estimator is used to obtain 7r 7 with its elements having the power 7 using 
the following theorem. 

Theorem 2 (a) For a given 7 £ the Penalized Dual estimator 

T 



7T 7 



(w 7 )} 7 5 --- 5 {p em (w 7 )} 
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is one EM-step from the candidate estimator 7r' ; consequently, 7r* yields a higher like- 
lihood value, (b) The estimators are in the unit simplex II* = {v?* 7 € ffl" 1 : 7r* 7 £ 
[0,1], 5^j=i ^1-7 = ( c ) The Penalized Dual solution w 7 satisfies p e ,(w 7 ) < 1 (j = 
1, . . . , m) and hence the estimator 7r 7 remains in the feasible region defined by (8). 

All the proofs are relegated to Appendix A. 4. 

The estimator 7? * provides a direct way to obtain the primal estimator 7r from our penalized 
dual solution, avoiding the problems of selection and inversion. 

3.4 Properties of the Penalized Dual Estimators 

In this section, we derive several statistical properties of the estimators. First, we establish 
that tv* converges to the MLE n as the penalty parameter 7 increases (Theorem 3 below). 
Along the way, we establish several important properties of the primal-gradient function 
that are necessary for solving the primal-dual problem. 

Although 7?* represents an EM improvement over 7? 7 , the candidate estimator, it is easier 
to establish optimization results for the latter. The following theorem shows that for a 
sufficiently large penalty, the Penalized Dual estimator will be close to the primal estimator. 
Let Q\ be the mixing distribution at the 7? 7 solution. 

Theorem 3 As 7 — > 00, g_ t — > g. . Consequently, the candidate estimator 7r 7 converges 
to the MLE 7?, whenever the latter is unique. 

Our goal is to obtain the mixture estimation problem from the penalized dual one using 
7T*. Therefore, it is important to determine directly from the dual problem how accurate is 
the estimator 7?*. We derive the gradient function corresponding to the mixing distribution 
Q 7 to accomplish this. It is easier to calculate this for 7?^ knowing that 7r* can only be 
better. From the primal-gradient function in (6), the gradient function for the estimator 
Q\, becomes 

Pgt(e)=EnJ m , for ^ 0m . (17) 

i=i l£fc=i< 7 /fl k (yi) J 
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Theorem 4 The primal-gradient function at the candidate estimator 7? ' can be written as 

Pe 

2 ? 3t(°j) = — 1 for G ®m(j = (18) 

^ 7 P7 



where 

(7-1) 

P., ' 

fc=l 



C — 11 

1 = EK(^)} • (19) 



Theorem 4 expresses the gradient function in terms of the dual solution and leads to a 
simpler device for checking the accuracy of the estimators. 

Corollary 5 At the candidate estimator tt\ the primal-gradient function satisfies 

£>gt (Oj) < P 7 - 1 for 6,j e ® m (j = 1, • • • , m), (20) 
where the term on the right-hand-side does not depend on j. 

We have established that one can refine the NPMLE of Q to the required accuracy by 
increasing m and 7 appropriately. 

4 The Structure of the Penalized Dual Algorithm 

In this section, we first investigate the structure of the penalized dual problem viewed as 
a function of w and 7 and next present a strategy for their joint estimation. Next, we 
present the Penalized Dual algorithm to effectively search over the discretized (but large) 
parameter space @ m C O. Lastly, convergence properties of the algorithms are derived. 

We let z = log (w) to eliminate the constraint w £ 
Theorem 6 (a) The function 

d 1 m 

£(z, 7 ) = ^p) Zi -- EK-( Z )} 7 for zGK and ( 21 ) 

i=i n 7 j=i 

is strictly concave in (z, 7), where p g . (z) = ^ exp(zj) / e . (y^). For any z S 5? in the feasible 
region defined by (8), the function /C(z,7) is strictly increasing as a function of 7. (b) The 
function /C(z, 7) is bounded above and achieves its maximum at z = z and 7 = 00. 
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4.1 Automatic Selection of 7 

A fundamental aspect of our algorithm is that we can maximize /C(z, 7) simultaneously with 
respect to z E and 7 6 K+; different from the approach employed in the conventional 
log-barrier methods in which this was not possible. Therefore, we can select the penalty 
parameter 7 automatically. From Theorem 6, the global maximum over z and 7 is attained 
when z = z and 7 —* 00. 

Remark 1: It may seem paradoxical to treat 7 as an unknown parameter even though 
it has an optimum value of 00. When 7 is large, /C(z, 7) has very severe curvature at 
the constraint boundary. This limits the range of effectiveness of quadratic approximation 
methods. Therefore, one should start with a small value for 7 and increase it as the 
algorithm progresses through the parameter space. This could possibly be achieved in some 
other systematic fashion; however, our empirical investigations suggest that systematic 
methods were not as efficient as our approach. A possible explanation could be that our 
strategy takes the curvature of the function /C(z, 7) into account, in providing the relevant 
information for determining the increments for 7. 

4.2 Searching Effectively Over the Discretized Parameter Space 

An algorithm for efficiently searching over the large discretized parameter space & m C O 
(required for Step 1 of Algorithm 1 described in Section 2.1) is derived next. In effect, the 
following algorithm is used for fitting the fixed support mixture model. 

Algorithm 2 (The Penalized Dual Algorithm) 

1. Consider 7=1 and its corresponding explicit solution z^ 1 ) given in (13) as the starting 
solution for the algorithm. 

2. Maximize the concave function /C(z, 7) simultaneously with respect to z S !R and 
7 6 3?_)_ using a modified Newton-Raphson algorithm [constrained the step size to 
ensure monotonicity in JC(z, 7)] until the following convergence criterion is satisfied; 
namely, the L2-norm of the change in the value of K(z,j) is less than 10~ 6 . 
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3. Fix 7 at obtained in Step 2 and find z (*) = arg max ze sR /C (zjW) 



using the 



modified Newton- Raphson algorithm. The algorithm is considered to have converged 
to the maximum at step t, when the inequality based on the primal- gradient function 



is satisfied since it guarantees convergence to a similar accuracy in the loglikelihood. 

As described in Section 2, the supremum of the gradient function sup eg @ m T>q(6) provides 
an assessment of the progression to the maximum and hence the criterion in (22) has a solid 
theoretical justification. 

Remark 2: The Step 3 of the algorithm is necessary since after Step 2, the Primal Dual 
estimator 7?* obtained via (14) are often not sufficiently close to the primal estimator 7r. 
This is because the algorithm does not necessarily satisfy the condition {<9/C(z, 7)/Sz} = 
with sufficient accuracy. In our applications, however, the primal-gradient inequality (22) 
was always achieved at the tolerance of 0.005; in fact, often reached significantly greater 
accuracy in the Penalized Dual estimators. 

The Penalized Dual Algorithm with Inactive Constraints: In Algorithm 2, if an estimated 
mixture probability ttj (j = l,...,m) is zero, then the corresponding constraint in the 
dual problem is inactive. We can dynamically update the active constraints by removing 
the inactive ones while adding new ones, whenever the support set violated the gradient 
inequality. From the Penalized Dual estimator in (14), it follows that if ttj — > 0, then 
{p g (w 7 )} 7 — ► which occurs when p g . (w 7 ) — > or 7 — > 00. In the former, one can 
essentially remove the corresponding density fg. (y«). It will be shown in Section 7 that the 
above algorithm, denoted by PD IC , produced a further reduction in computational time. 

As a consequence of the concavity of /C(z,7) established in Theorem 6, the Hessian H for 
/C(z, 7) (derived in equation (A. 7) in Appendix A. 3) is always non-singular and the sequence 
obtained from the Penalized Dual algorithm (i.e., Algorithm 2) are well defined. Even for 
large-scale problems such as the yeast microarray data considered in Section 7.3 in which 
H is of dimension 697, our modified Newton-Raphson algorithm was stable and efficient. 




(22) 
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Theoretically, owing to Theorem 6, the Step 2 of the Penalized Dual algorithm produces a 
sequence {z^ k \ 7W } fc>1 such that the sequence of functions {lC(z^ , 7 (fc) )} fc M -> fc(z,oo) 
as A; — > 00. This effectively implies that the sequence |z( fc )} fc>1 — > z and the sequence 
IfcM ~~ * 00 as ^ ~~ * 00 • However, in practice, convergence of 7 is slow; therefore, we 
terminate the modified Newton-Raphson algorithm in Step 2 when 7W is sufficiently large 
and maximize K, (z, 7W) over z for a fixed 7^). 

In our experience, a direct maximization of the mixture loglikelihood l(ir) over II using a 
modified Newton-Raphson algorithm was unstable and failed to converge to the maximum 

7T. 

5 Convergence Properties of the Algorithms 

In this section, we establish the convergence properties, including the rate of convergence, 
of the algorithms. 

First, we consider the algorithm for fitting the continuous support mixture model; i.e., an 
algorithm employed in Step 2 of Algorithm 1. We prove that the sequence of estimates 
{/3^} s >i obtained from the Step 2 of Algorithm 1 converges to an MLE of (3 6 O, namely 
(3, for a given data y 6 y as s increases. For instance, in the multivariate normal mixture 
framework, (3 becomes (Q, S). Assume that the sequence of estimates {/3^} s >i monoton- 
ically increases the loglikelihood An algorithm is said to converge if j3* = \\m. s (3^ 

exists, for a parameter vector (3 € f2. 

Wu (1983) established that monotonicity of /(•) does not imply the convergence of the 
sequence to a stationary point; however, if the sequence {l((3^)} s >i is bounded above, 
then it does converge monotonically to a stationary point of l(f3). The convergence of 
(3^ to (3 implies the convergence of l((3^) to l{(3) according to the Theorem 5, under the 
regularity conditions, derived by Wu (1983). 

Owing to Theorem 6, the Algorithm 2 (or Step 1 of Algorithm 1) produces a sequence of 
estimates {7r^}t>i that is guaranteed to converge to the unique MLE 7v. Combined this 
result with Theorem 5 in Wu (1983) establishes the convergence of Algorithm 1 to (3. 
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5.1 Convergence Criteria 



For the applications and simulation experiment, we used the convergence criterion based 
on the gradient function for the Penalized Dual (PD) and discrete EM (i.e., for fitting the 
fixed support mixture model) algorithms. That is, the algorithm has converged to the MLE 
7? if the criterion in (22) is satisfied. For the rest of the article, we denote the discrete EM 
by D-EM algorithm. 

The D-EM algorithm is a sublinearly convergent algorithm (Pilla and Lindsay, 2001); there- 
fore, a conventional convergence criterion based on the loglikelihood change or changes in 
parameters, such as 

e W = \l^ t A-l(^ t - 1 ^\<T (23) 

for a given tolerance r can be very misleading in the sense that the actual distance to the 
final loglikelihood 

AW = |Z | (24) 

can be orders of magnitude different from r. That is, this criterion may be met even 
though the parameter values are far from the correct solution (Titterington et al., 1985; 
Pilla and Lindsay, 2001). However, such rules are widely employed and therefore we con- 
ducted an experiment to assess the two criteria on two data sets. 

The most important assessment of the convergence of an ML algorithm is the value of the 
loglikelihood, as it provides information about the accuracy of parameter estimators on a 
confidence interval scale. Therefore, loglikelihood-based criterion is a useful one to employ 
in assessing the convergence of an algorithm in finding the MLE of the parameters (Lindsay, 
1995; Pilla and Lindsay, 2001). 

Simulation Experimental Design: We consider the simulated data by generating a sample 
of size n = 270 from N P (Q,I) with p = 3, where N p (Q,I) represents a measure of a 
p-dimensional normal random variable with mean Q and an identity variance-covariance 
matrix. The true mixing measure for Q € Q, is chosen by selecting the coordinates of 
Oj ( j = 1, . . . , m) from the set {—5,0,5} in all possible combinations, with equal mass at 
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each support vector. This resulted in a total of m = 3 mixture components. 

Fisher Iris Data: We fit a mixture of multivariate normal distributions to Fisher iris data 
(Fisher, 1936). The data consists of n = 150 observations collected on flowers of three iris 
species (Setosa, Verginica and Versicolor). Each observation is a vector of p = 4 variables 
sepal length (yi), sepal width (y2), petal length (y^) and petal width (y4). 

Table 1: Effect of a convergence criterion on the final loglikelihood in fitting the fixed 
support mixture model. 



Experiment 


t 


I (ttW) 


* (QW) 


A« 


Simulated 


1067 


-2313.6826 


0.0830 


0.0536 


Fisher Iris 


460 


-376.9595 


3.0017 


0.0156 



For each of the data sets, we selected the observed data matrix y for m and also set X = S, 
the sample variance-covariance matrix. In order to assess the accuracy of the algorithms 
at a given step t, we found the final loglikelihood value l(jr) to a high degree of accuracy 
using the PD algorithm for a sufficiently large t. Next, we fit mixtures of multivariate 
normal distributions to the simulated and iris data sets via the D-EM algorithm using 
the convergence criterion (23) with r = 0.0001. The A" values, presented in Table 1, 
demonstrate that the convergence criterion (23) would result in substantially less than four 
decimal accuracy for the Fisher iris data. On the other hand, the criterion based on \I/(QW) 
in (22) guarantees the final accuracy. 

5.2 Empirical Assessment of Convergence Rate 

We empirically assess the rate of convergence of the PD algorithm relative to the D-EM 
algorithm by defining A^ in (24) as the residual of the loglikelihood at the tth. step. 

To be precise, for some tv^ £ II, let I 71 " } t>1 be a sequence in II generated by an 
algorithm (such as the PD and D-EM algorithms). The algorithm can be expressed as 
Tj-W g M. (ttC" 1 )) for t > 1, where the map M. : II — > 2 n is a point-to-set mapping. If ir^ 
converges to 7? and M(-) is continuous, then tv must satisfy tv € A4(tt). 
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Definition 2 (Asymptotic Convergence Rate): Assume that n = A4(tt) and that the sequence 
{■7r^}t>i is generated by the map M. such that lim^oo Under the regularity 

conditions given by Wu (1983), this implies that lim^oo I (#) = 1(tt). The asymptotic 
convergence rate of the loglikelihood sequence \l { 7V ^)} t>1 a t /(w) generated by an algorithm 
is defined as 




From the following lemma (Pilla and Lindsay, 2001), the smaller the r for any given log- 
likelihood sequence, the faster it is progressing towards the MLE. 

Lemma 7 If the sequence {l(^^)}t>i is converging linearly, then as t — ► oo, the slope 
of the curve obtained by plotting logjA^} against t converges to log(r), where r is the 
asymptotic rate of convergence of the loglikelihood sequence generated by an algorithm. 

In order to assess the rate of convergence of the PD, relative to the D-EM, algorithm, we 
consider the Fisher iris data considered earlier for fitting a collection of semiparametric 
mixture of multivariate normal distributions N P (Q, <5E), where p = 4, 5] £ S and 5 6 
(details in Section 6.2). As before, we selected the observed data y for m and set 
5] = S in fitting the PD and D-EM algorithms. Figure 1 demonstrates the behavior of the 
algorithms for 6 £ {5,2,1,0.5}. We used logarithmic scaling of the vertical axis since a 
linearly convergent algorithm will become linear on this scale as t — > oo. Note that for the 
fixed support mixture model, the D-EM algorithm is converging sublinearly whereas the 
PD is converging linearly to the MLE n ; a significant improvement in convergence rate. In 
fact, Pilla and Lindsay (2001) observed a similar behavior of sublinear convergence of the 
D-EM algorithm for a class of univariate finite mixture problems. 

6 Semiparametric Mixtures of Multivariate Normal Distri- 
butions 

The methodology developed in this article is applicable to a wide range of problems, includ- 
ing multivariate t mixtures. However, the particular interest here is in difficult problems 
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Figure 1: Plot of the log residual of the loglikelihood, logjA^} against CPU time for the 
Fisher iris data with four variables demonstrating the sublinear convergence of the discrete 
EM (dashed line) and the linear convergence of the Penalized Dual (solid line) algorithms. 

with multivariate normal mixtures due to its ubiquitous applications. In semiparametric 
mixture setting, the mixing distribution Q is modeled nonparametrically in the presence of 
an unknown S € S, the variance-covariance matrix common to all m components. 

6.1 Structural Properties 

Let g Q (yi; X) = Y^jLi 71 i f^j (y*> ^0 f° r y« ^ ^ be a finite mixture of p-dimensional normal 
distributions, where fj,j £ W C f2 is the mean vector of the jth component density f^. (y^; S) 
and 5] E S is common to all m components. Note that in the continuous case d = n. The 
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corresponding loglikelihood is expressed as 



/(Q;E) = ^log{g s ( yi ;5:)}. 



(25) 



In the univariate case, Charnigo and Pilla (2005) establish that for a general family of 
mixture models with a structural parameter f3 (e.g., a 2 in the normal case), the likelihood 
framework breaks down when joint estimation of m, Q and (3 is attempted: at best the joint 
estimator of m, Q and (5 is degenerate, and at worst it does not even exist. The ML fails 
in this setting since taking finite samples from continuous probability distributions yields 
discrete data sets. When models that closely mimic discrete probability distributions are 
available, as they are when there are no restrictions on Q and /3, the likelihood will favor 
such models. The NPMLE results of Lindsay (1995, Section 2.6) cannot be applied if S is 
unknown; however, the following result holds. 

We define the gradient function for the multivariate normal mixture distributions as 




(26) 



Next, we define an NPMLE of Q G Q for a fixed £ G S as 



2, 



arg max l(Q; £). 
Qeg 



Theorem 8 (Unique NPMLE of Q) Assume £ > is fixed. 
(1) Suppose Q s satisfies 



V d (/z; S) < for all /x G 



(27) 



then Q s is an NPMLE of Q G Q. 

(2) Let the set . . . , fi K } for some K < n be the solution set 



{ M :^ 2j >;£) = o}. 



If the vectors 



f M . (y; E) = (y i; E), . . . , / M . (y n ; E) } 



for j = 1,...,K 



are linearly independent, then Q s is the unique NPMLE of Q G Q. 
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For the multivariate normal mixture model with a common S, we restrict attention to 
finite discrete latent distributions Q, then the pair (Q, X) is identifiable (Lindsay, 1995). 
For a general family of univariate mixtures, Charnigo and Pilla (2005) establish that joint 
estimation of m, Q and j3 is a well-defined problem if Q is finitely supported. If Q is not 
finitely supported, then g s (y;S) need not determine Q and j3 uniquely. Hence, an ML 
approach to the joint estimation of m, Q and 5] fails. However, since we fix m and consider 
Q to be finitely supported, joint estimation of Q and XI is feasible. Therefore, we can apply 
Algorithm 1 described in Section 2.1 to jointly estimate Q and XI. 

The following theorem establishes that joint identifiability of (Q, S) fails if Q is not finitely 
supported. The proof follows from the univariate nesting structure result, under mild 
regularity conditions, given by Charnigo and Pilla (2005). 

Theorem 9 ( Multivariate Mixture Nesting Structure ) The class of multivariate nor- 
mal mixture distributions possesses the nesting structure. That is, for any y S, in the 
sense of Lowner ordering, 

{N P (Q, : Q G S} C {N P (Q, E) : Q G £}, 

where N p (Q, £) represents a measure of a p-dimensional normal random variable with mean 
Q and a variance-covariance matrix XI 6 S. 

6.2 Role of the Sieve Parameter in Building Semiparametric Mixture 
Models 

We investigate building the sieve of models N p (F, ST,), where S £5 and 5 G 5R + is a sieve 
parameter (similar to the smoothing parameter employed in density estimation). The sieve 
parameter controls the dimensionality of a mixture model as will be demonstrated later. 
We derive theory for building a collection of semiparametric mixture models, including the 
multivariate case. 

In order to create a general family of mixture models, we consider the class {N p (Q, 5 S) : 
Q G Q, £ G S, 5 G For 8\ > 6q, Theorem 9 implies that 

{N P (Q, 5 1 S):Q£G}C {N p (Q, 5 S) : Q G G}; 
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hence, the collection of models becomes richer as 5 — > (see Figures 2 and 3). Moreover, 
every p-dimensional distribution F can be obtained as the weak limit of N P (F, 5 X) as 5 — > 0. 
Therefore, we can approximate any distribution by choosing 5 small. As a consequence, 
the principle of maximum likelihood cannot be applied to select 5 in the model N P (Q, S X!) 
since the likelihood becomes unbounded as 5 — > 0. Charnigo and Pilla (2005) develop theory 
for the univariate mixtures and demonstrate the effect of small 5 for a general family of 
univariate mixtures which extends to the multivariate case considered here. 

We create a strategy for building a collection of models {(Qs, S'S): Q € Q,S G using 
the Penalized Dual algorithm. As 5 — > 0, the NPMLE Q$ converges in distribution to 
n" 1 X]i^(y«)> where i9(y) is a discrete measure concentrated at y. 

To demonstrate the effect of 5 on the mixture complexity, we create a collection of models 
for both the univariate and multivariate data. In the univariate case, we simply have a 
a parameter. The univariate application considers the galaxy data set [Table 1 of Roeder 
(1990)] of 82 observations of relative velocities for galaxies from six well separated conic 
sections of the Corona Borealis region. Scientific interest lies in identifying substructures 
in clusters of galaxies. Multimodality is evidence of voids and superclusters in the far 
universe. Roeder (1990) obtained a = 0.95 using least squares cross validation. We set /i. £ 
©m = {9, . . . , 35} with a grid size of 0.02 for building a collection of semiparametric mixture 
models using Algorithm 2. The plot of log (a) against the support set fx corresponding to 
the estimate 77 obtained using the PD algorithm (namely, Algorithm 2) is shown in Figure 
2. That is, at each fixed log(cr), the plot displays \x parameter values that have positive 
mixture probability. The figure demonstrates the effect of a on the mixture complexity m. 

Next, we consider the Fisher's iris data described earlier. Once again, we selected observed 
data y for & m and set 5] = S. We let 5 = {0.1, . . . , 5} for building a collection of semi- 
parametric multivariate mixture models using Algorithm 2. Figure 3 shows the effect of 
5 on the mixture model complexity when only the two variables, namely the petal length 
and petal width are considered. The galaxy and Fisher iris data sets demonstrate that the 
number of components is a consequence of the choice of a (or 5 as the case may be) rather 
than a pre-selected parameter. 
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Figure 2: A Mixture Tree for the galaxy data set. 
6.3 Selection of the Support Set of Q 

As described in Section 2.1, in approximating Q, the biggest challenge is in selecting a 
suitable m while keeping computations manageable. This is addressed in this section. 

In the absence of a prior knowledge of the mixture complexity, correct specification of m C 
f2, the support set of Q, is very important for the Step 1 of Algorithm 1 (or equivalently for 
Algorithm 2). As expected, the final loglikelihood depends on this choice. In this section, 
we illustrate through the simulated data described earlier how the observed data matrix y 
provides the best choice for approximating the continuous parameter space 0. In effect, we 
select {#i = yi,02 = y2, • • • , m = y^}. Note that choosing y for the discrete parameter 
space m clearly covers the region of likely support vectors for the normal means and has 
the advantage of adapting naturally in richness to the sample size of the problem. 

To assess the effectiveness of using y for m (which is approximating the continuous pa- 
rameter space Q) we consider the simulated data described in Section 5.1. The true mixing 
measure for Q chosen for the simulation experiment is denoted by "True Support" in Table 
2. The "Equi-Distant" set for m was constructed on a lattice by choosing the elements 
in 0j = (6ji,6j2, Ojs) for j = 1, . . . , 8 3 from the set {—7, —5, . . . , 5, 7} resulting in a total of 
m = 8 3 support vectors; this set also included all the true support vectors. Table 2 presents 
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Figure 3: (a) The scatter plot of the Fisher iris data for the two variables, namely the petal 
length and petal width, showing three main groups, (b) A Mixture Tree demonstrating the 
effect of the sieve parameter 5 on the numbe¥*of components. The three main branches in 
the tree correspond to the three main components in (a). 



results obtained using Algorithm 2 (i.e., fixed support mixture model of estimating 7r for a 
given m ) and Step 2 of Algorithm 1 (i.e., continuous support mixture model of estimating 
Q and XI for a fixed m). 

In general, y should be an effective choice for @ m given that equi-distant is still a subjective 
one in the absence of any knowledge about the length of the distance. From the theory 
presented in Section 2.1, as m — > oo, ® m — > Q. However, in practice, choosing m = n is 
effectively creating a dense set for m and in fact approximating Q very well. 

Table 2: Effect of ® m C f2, the support set for Q, on the estimated loglikelihood. We set 
S = S in finding 1(tt) using the Penalized Dual (PD) algorithm. The solution -rv obtained 
from the PD algorithm with the corresponding fixed support set and S are employed as 
parameter starting values for finding I (^Q$,5'£ S J for a fixed 5 = 0.2 using the continuous 
EM (C-EM) algorithm. 

®m *_g) l(Qs,6v) 

True Support -2181.9 -1936.9 
Equi-Distant -2182.8 -1901.7 
Observed Data -2178.6 -1876.0 



7 Applications and Simulation Experiment 

The applications in this section are used to investigate the roles of many overlapping com- 
ponents which create an ideal situation for solving the large-scale practical problems. 

We assess the performance of the algorithms in finding the NPMLE of Q and for fitting 
the collection of semiparametric mixture models with applications to several data sets. The 
data sets, the parameter estimates and the Matlab software for fitting mixtures are available 
from the first author. For the Step 1 of Algorithm 1, we set XI = S; however, we estimate 
it in Step 2. 
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7.1 Mortality Data 



Our first application illustrates the tremendous advantage of our method in reducing the 
dimension of discrete mixture problems. In these problems the magnitude of d, the number 
of distinct observed data points, could be much smaller than m, the cardinality of @ m . 

We consider the data on death rates which gives the number of death notices for women 
aged 80 and over, from the Times newspaper for each day in the three-year period 1910 to 
1912 (Titterington et al., 1985). For the later data sets, we chose the observed data matrix 
y as the support set m . However, for this application, we selected the support set to be 
©m = {0, + r], ... ,9 — rj, 9}, where r] G {1, 0.5, 0.1, 0.01}. In effect, the mixture complexity 
m G {10,20,100,1000}. It is worth noting that the dimension of the dual optimization 
problem is d (equals 10) whereas that of the mixture problem is (m — 1) which grows 
significantly with the cardinality of the set rj. 

We fit a mixture of Poisson distributions to the mortality data using the PD and D-EM 
algorithms. Table 3 presents the number of steps required for convergence [based on 
the criterion (22)] and "CPU Factor", the ratio of the CPU time required by the D-EM 
algorithm to that of the PD. This ratio indicates the factor by which the D-EM algorithm 
is accelerated. We also present the values of 1(tv), I(Qtj), and ty(Q^). Furthermore, 
we consider the effect of eliminating the inactive constraints in the PD algorithm, namely 
PD IC . The table demonstrates that the PD-based algorithms advance toward the maximum 
more rapidly than does the D-EM algorithm with gains increasing as m (equivalently, the 
number of parameters to estimate) increases. Thousand-fold improvements are obtained at 
r] = 0.01 for which the number of parameters to estimate is the largest. For comparison, 
we fit the same model with the Rotated EM (an accelerated version of the EM applicable 
only for univariate mixtures) developed by Pilla and Lindsay (2001) and obtained CPU 
factors for the PD, relative to the Rotated EM, as 1.4, 22, 43 and 28, respectively for 
rj G {1,0.5,0.1,0.01}. 

At rj = 0.01, the D-EM has retained 199 support points with non zero probability at 
convergence (obtaining a smaller loglikelihood value) whereas the PD has retained just 26 
support points and reached the MLE in a reasonable number of steps; a significant reduction 
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Table 3: Building a collection of finite mixture models |Qr?: f] G j using Algorithm 1 
for the mortality data. First step involved setting & m = {0, + 77, . . . , 9 — 77, 9} with rj £ 
{1, 0.5, 0.1, 0.01} and finding Z(7?) using the PD-based and discrete EM (D-EM) algorithms. 
The estimate 7v obtained from the PD algorithm with the corresponding fixed support set 
is used as parameter starting values for finding I {^Qn*j using the continuous EM (C-EM) 
algorithm. 







7 f-^\ 

1{TZ) 


A (t) 


w [Q y ') 




CPU 


A 1 * j 1 

Algorithm 


V 




x 10 3 


x 10 3 




Factor 


PD 


1 


-1990.0928 


0.0000 


0.2577 


25 


5 


p D IC 




-1990.0928 


0.0000 


0.2577 


25 


7 


D-EM 




-1990.0929 


0.0172 


4.9885 


1,238 


1 


C-EM 




-1989.9 






2,179 




PD 


0.5 


-1989.9941 


0.0000 


0.1881 


26 


120 


PD ic 




-1989.9941 


0.0000 


0.1881 


26 


142 


D-EM 




-1989.9949 


0.7136 


4.9997 


31,149 


1 


C-EM 




-1989.9 






2,360 




PD 


0.1 


-1989.9281 


0.0000 


0.2521 


25 


638 


PD ic 




-1989.9281 


0.0000 


0.2520 


25 


719 


EM 




-1989.9322 


4.0901 


5.0000 


108,312 


1 


C-EM 




-1989.9 






1,997 




PD 


0.01 


-1989.9272 


0.1108 


0.2270 


27 


943 


PD ic 




-1989.9272 


0.1108 


0.2269 


27 


1,192 


D-EM 




-1989.9319 


4.8230 


5.0000 


113,081 


1 


C-EM 




-1989.9 






1,924 





33 



in the mixture complexity. For this case of 5, most of the mixture probabilities are near 
zero; hence the algorithms must push the estimates to the boundary of the parameter 
space — a least favorable case for the D-EM algorithm. When the NPMLE has fewer than 
d support points (an overparameterized mixture problem), then the D-EM algorithm has 
great difficulty in allocating probability to the redundant support points. The behavior 
of the cumulative distribution function (CDF) of Q for the two algorithms at rj = 0.01 is 
shown in Figure 4. It is clear that the D-EM algorithm has an extremely small step size 
whereas the PD has a reasonably large step size. This is due to the fact that the D-EM 
has retained a significantly large number of components with small jumps — an artifact of 
its failure to converge to the MLE in finite number of steps. From a model selection point 
of view, the D-EM fails to eliminate the redundant components while the PD algorithm 
provides a parsimonious mixture fit. 

7.2 Simulation Experiment 

We consider the simulated data described in Section 5.1. The data were generated from the 
multivariate normal mixture densities N P (Q,I) with p = 3 and n = 270 (see Table 2) by 
selecting the true mixing measure for Q £ Q as the coordinates of 6j (j = 1, . . . , to) from 
the set {—5, 0, 5} in all possible combinations, with equal mass at each support vector. 

Following Section 6.2, we apply the Penalized Dual algorithm in the context of building a 
collection of semiparametric mixture models for selected values of the sieve parameter 5. 
This will yield estimators with both many and few active support vectors; thereby provid- 
ing a mechanism to demonstrate the superiority of our method over the D-EM algorithm 
across a range of applications. Both the PD and PD IC algorithms provide uniformly better 
performance, producing 6 to 40-fold improvement in CPU factor over the D-EM algorithm. 
As illustrated in Section 6.2, in fitting N P (Q, <5S), there is a trade-off between decrease in 
the sieve parameter 5 and the increase in mixture complexity to; by increasing 5, we obtain 
a reduction in the mixture complexity to. 

As discussed in Section 5.1, an important attribute of the convergence of an algorithm is 
the value of loglikelihood, as it indicates accuracy on a confidence interval scale. Therefore, 
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Figure 4: (a) Behavior of the cumulative distribution function (CDF) of Q for the fixed 
support mixture model obtained from the D-EM (black) and PD (red) algorithms at r\ = 
0.01 for the mortality data, (b) An enlarged view of graph (a) at the first jump; the D-EM 
algorithm has many small jumps and retained the redundant components. 
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Table 4: Building a collection of semiparametric mixture models |(^2<5> SYlj : 5 £ 5?+ j 
using Algorithm 1 for the simulated, Fisher iris and Yeast microarray data sets. First step 
involved choosing the observed data matrix y for @ m and setting S = S in finding 
using the PD and discrete EM (D-EM) algorithms. The estimate 7? obtained from the 
PD algorithm with the corresponding fixed support set and 5 S are employed as parameter 
starting values for finding I (q$,5'£) using the continuous EM (C-EM) algorithm. 
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in Table 4, we present the loglikelihood values obtained using various algorithms. The CPU 
factor for the PD algorithm over the EM algorithm ranged from ten to over forty-fold for 
5 6 {5,2,1,0.5,0.2}. As predicted by the theory, for the PD and D-EM algorithms, the 
final *(Q(*)) given by (22) does provide a guarantee on the level of algorithmic convergence 
AW. Indeed, m some cases the bound A® < ^(Q (t) ) was very conservative. Moreover, 
the convergence criteria for the PD algorithm described in Section 4.1 achieved the desired 
accuracy in A^; however, typically the PD algorithms terminated at a considerably higher 
accuracy than the D-EM algorithm. In order to measure this effect, we continued the D-EM 
algorithm to the same level of accuracy as that of the PD for 8 = 1. In this case, for the 
EM algorithm, = 9, 987 at convergence, resulting in a CPU factor of 60 instead of 24. 

7.3 Fisher Iris and Yeast Microarray Data Sets 

We fit a mixture of multivariate normal distributions to the Fisher iris data described earlier 
by finding the NPMLE of Q and by building a collection of semiparametric mixture models 
for selected values of 5; results are presented in Table 4. The performance of the PD and 
D-EM algorithms was similar to that of the simulated data. 

Instead of choosing the PD solution as the parameter starting values for the C-EM, we 
consider random values to demonstrate their effect on a given algorithm. It is important 
to recognize that the C-EM algorithm requires an a priori knowledge of m. We considered 
the Fisher iris data with 5 = 1 and randomly selected m = 15 data vectors from n = 150 as 
parameter starting values for the algorithm. In the ten runs of the C-EM algorithm with 
random starting values, l(Q, 5]) ranged from -151.99 to -179.13; all of which are sub-optimal 
modes due to the "poor choice" of starting values. Similar behavior of the C-EM algorithm, 
in reaching a sub-optimal solution, was observed by Pilla and Lindsay (2001) for the galaxy 
data. Without an a priori knowledge of m, choosing m can be quite a challenge for using 
the C-EM algorithm in large-scale practical problems. 

The main technology for conducting high-throughput experiments in functional genomics 
is the microarray — a technical approach for assaying the abundance of mRNA for several 
genes simultaneously (see Hastie et al. (2001) for literature). A gene expression data set 
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collects the expression values from a series of DNA microarray experiments with each col- 
umn representing an experiment. Analysis of the expression patterns obtained from large 
gene arrays reveal the existence of clusters of genes with similar expression patterns. It 
is common to write the gene expression data of n genes, each measured at k individual 
array experiments (e.g., single time points or conditions) as an n x k matrix. Holter et al. 
(2000) analyzed a subset of the original published yeast cdcl5 cell-cycle data which consist 
of n = 696 genes under p = 12 time points or conditions. An important scientific question is 
to find out which genes are most similar to each other, in terms of their expression profiles 
across samples. One way to organize gene expression data is to cluster genes on the basis 
of their expression patterns. One can think of the genes as points in which we want to 
cluster together in some fashion. 

We fit multivariate normal mixtures to the yeast microarray data by finding the NPMLE 
of Q. This is an example of high-dimensional modeling. We observed similar performance 
of the algorithms to the previous examples. Table 4 presents the loglikelihood values. Since 
each observation is a point in 3ft 12 , at S = 0.2, we obtain the empirical CDF as the MLE. 
That is, each observation is its own component; hence the solution is not interesting. 

8 Discussion 

In this article we developed a framework for approximating the continuous parameter space 
and created an algorithm (based on the Penalized Dual method) for finding the maximum 
of Z(Q); consequently an algorithm for estimating the mixture complexity. We established 
convergence properties of the proposed algorithm. By exploiting the inherent advantage of 
the penalty formulation, we derived a technique for converting the parameter estimators 
from the Penalized Dual problem into those for the mixture probability parameters. We 
established the existence of parameter estimators and derived convergence results for the 
Penalized Dual algorithm, for fitting overparameterized mixture models. It was shown 
empirically that the Penalized Dual algorithm has a faster rate of convergence, compared 
with the discrete EM algorithm for overparameterized mixture problems. 
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The algorithm based on the Penalized Dual method reaches closer to the global maximum 
and is robust to the choice of the support set & m C f2 (dimensionality of the problem). 
These are desirable features for (1) analyzing high-dimensional data, and (2) for building 
a collection of semiparametric mixture models. The dimension of the dual optimization 
problem is fixed at d, the number of distinct observed data vectors; whereas that of the 
discrete EM grows with the cardinality of & m . For discrete mixture problems, such as 
binomial or Poisson, often d«n; therefore, there is no dimensionality cost with the dual 
problem. When the cardinality of m is large, the discrete EM algorithm fails to converge 
to the MLE, for all practical purposes, in certain mixture problems. 

We derived several important structural properties of multivariate normal mixtures in which 
Q is modeled nonparametrically in the presence of an unknown variance-covariance matrix 
S £ S common to all m components. The role of the sieve parameter in reducing the 
dimension of the mixture problem was demonstrated by creating new graphical devices, 
namely the Mixture Tree plots. 

The proposed methods are very powerful in searching over the whole discretized parameter 
space and in yielding a parsimonious mixture model. The discrete EM algorithm can 
be very difficult, if not impossible, in yielding a parsimonious model in problems with 
hundreds or thousands of parameters. Such problems are becoming increasingly common 
due to the rapid explosion of high-throughput data in microarray data and data mining. The 
applications for the methods described in this article are rich. Multivariate normal mixtures 
arise in many different practical scenarios, including data mining, knowledge discovery, data 
compression, pattern recognition and pattern classification. 
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Appendix: Technical Derivations 



A.l Relation Between the Primal and Dual Problems 

We establish the relation between the primal and dual problems at the solution using the 
change of variable g Q (yi) = {rii/n){wi)~ l (i = 1, ...,d). As a first step, we prove the 
following claim. 

Claim. The maximization of the primal problem in (3) is equivalent to 

d 

min ^2 n i lo & {Sq (y»)} ( AA ) 
Sq i=i 

subject to g Q = {g 2 (yi), . . . ,g Q (y d )} T € and V Q {0 3 ) < for Gj 6 @ m (j = 1, . . . ,m), 
where T>Q(6j) is defined in (6). 

The gradient constraints T>Q(Oj) < can equivalently be expressed as 

E(-)^TT<1 ^ j = l,...,m. (A.2) 

Let <2* € Q be the solution to the primal problem in (3) and let Q € Q be any solution 
that satisfies constraints of the dual problem in (A.l). The equivalence between the primal 
problem in (3) and the dual problem in (A.l) follows by establishing that 

d d 

n i lo g {§q (y<)} ^ Ui log { g s* ( yi )}- ( A - 3 ) 

i=l i=l 

Since log (x + A) > logs + A/(x + A), where a; + A = g Q and x = g Qi , , the above inequality 
yields 

Ym 1 °g{g s (y l )} > lo g{g a *(y«)} + ^ n i ^ g a( y ^ gQ*(y»)i 

i=i i=i i=i g s lyiJ 

= E"<'°gfe a -(y.)}-E". { ^frf < A - 4 > 

The second term in the right-hand side of (A. 4) is less than zero since T>q(6j) < and 
hence the relation (A. 3) holds. Therefore, the claim is established. 

Define Wi = (nj/n){g Q (y^)} -1 so that the constraints in (A.2) become Yli w i fojiYi) — 1 
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for j = 1, . . . , m. From this definition of u>j, the dual problem in (A.l) can be expressed as 

( d d \ 



mm 

+ k 1=1 



^njog (-M - ^ n f logO 



i=l 



Equivalently, the problem is max £V raj log (wi) subject to w G 3?^. which is the dual opti- 

W 

mization problem in (7). 



A. 2 Derivation of the Penalized-Dual Estimator tt 



3,1 



First, from the primal-dual relationship, it follows that 



71; 



n 



{/*i(yO} 



for yi G y; i = 1, . . . ,d. 



(A.5) 



Second, the following fixed-point equation is obtained by solving (12), 

-l 

( 7 -l) 

[P., (w 7 ) j 

3=1 



Hi 

n 



for z = 1, . . . , d. 



(A.6) 



By comparing the right-hand sides of (A.5) and (A.6), it is clear that g~{yi) parallels the 
term ^Ap g . (w 7 )}^ 7-1 ' fg (yj) and that the latter expression resembles a mixture density 

3 j ^ 

with {p g (w 7 )} (T" 1 ) playing the role of 7fj. 



A. 3 Hessian Matrix of the Function K(z,j) 

Let F = (fg 1 , . . . ,fe m ) T be an (m x d) matrix where f 0j = {fe^yi), . . ■,fe J (yd)} T is the 
(i-dimensional vector. In the sequel, we denote a vector of ones by 1 (with dimension clear 
from the context) and the diagonal matrix with elements x by diag(x). Therefore, 

£(z, 7) = -n T • z - - 1 T • p 7 for zeS and 7 G R+, 

n 7 

where n = (m, . . . , ra^) T and the constraint vector 



P = {p 9i ( z )>--- >Pe m ( z )} 



with p g .(z) (sometimes written as pj for exposition) is as in (11) expressed in terms of 
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The Hessian matrix of /C(z, 7) has the following elements: 



0_ 

Oz 



£(z,7) 



n 



1? 



diag(w) 



P ( - 1} }, 



2 



dzd^^ = -diag(w.{F r .p(^)} 



-(7 - 1) diag(w) • F T diag {p (7_2) } F • diag(w) 



/C(z, 7 ) 



7 

1 

7 



log(Pi' 



g/C(z, 7 ) 
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m r 1 

.7=1 k ' 



■ log(Pi) - ^ 



where w € is expressed as {exp(zi), . . . , exp(zrf)} and 



d 2 



dz 



1 m 

K ^i) = --Efe) (7_1)1 °g^)- 

' A — 1 



i=i 



(A.7) 



A. 4 Proofs 

In the sequel, we write p g (w 7 ) = pij a for exposition. 

Proof of Theorem 2. From the fixed-point equation (A. 6), we have the EM solution 

feAyi) 



7T 



j, EM 



for 0j £ & m (j = 1, • • • 

1 (Yi) 



"7 J "3 



(A. 



which follows from (16), where p 7 is given in (19). From (A. 6), the last equation becomes 

©,7} (7_1) 

1=1 

This again simplifies to Pj.-y = {Pj',7} 7 due to the relationship in (15). Thus 

em = 7 = ^j. 7 an< ^ * ne P ro °f °f P ar t ( a ) follows. As a consequence of the EM result, 

the estimators are in the unit simplex II* as claimed in part (b). Proof of part (c) follows 
by using the first inequality in part (b) in conjunction with relation (14). These two imply 
that p g {wi n ) < 1 for all 6 £ m . Hence, our estimator is in the feasible region as claimed. 



Next, we need the following lemma to prove Theorem 3. 
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Lemma 10 As 7 — > 00, 



1. 



Proof. From the following Lyapunov's inequality (Lehmann, 1999), 



one can find a bound for p 7 . For m number of constraints, it follows that 

( m I 

< (E-fe} 



1 

(7-1) 



m f ■ 1. 



Equivalently, Sj{Pj,7} ^ 7_1 ^ — Hence as 7 — > 00, we obtain p" 1 — > 1. I 

Proof of Theorem 3. From Corollary 5, we have 

X>gt (0j) < p 7 - 1 for 0, e e m (j = 1, . . . , m), 
where Pgt (0j) and p 7 are defined in (17) and (19), respectively. 

From Lemma 10, we have p 7 1 — > 1 as 7 — > 00. Therefore, in the limit, the primal-gradient 
function satisfies the inequality 

lim V & (Oj) < for Gj € ® m (j = 1, . . . , m). 

The compactness of the parameter space II can in turn be used to establish the convergence 
of g_ f to the maximizing value g. . If the vector of masses it for g. are uniquely determined, 
then the masses must converge as well. This in turn implies that as 7 — ► 00, the mixing 
distribution Q 7 with 7?^ as the vector of masses is the NPMLE. Consequently, 7?j 7 — > 7?-,- as 
7 — > 00 for j = 1, . . . , m. I 

Proof of Theorem 4- From (17) and the relation 7?j 7 = {pj, 7 } ^ 7_1 - ) p 7 , it follows that 

p^ 1 /<?,■ (y0 



*>&('i)=£ 



J?.; 



i=l 



E fc {^7} (7_1) A*(y 



for 0j € m . 



From (A. 6), we have u5j i7 
simplifies to 



71, 



SfclPfc^}^ 7 ^ fe k (yi) an d hence the last equation 



1 £ ( y *) ~~ 1 for J G ® m 



2=1 
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The desired result follows from the definition of pj. 



Proof of Theorem 6. The proof of part (a) is a consequence of the negative definiteness of 
H and the properties of V~(w,^) given in Proposition 1. For part (b), let 



for any fixed 7 G 3?+. That is, z 7 is the maximizer of the IC(z, 7) for a fixed 7. From 
equation (21) for a given z 7 , it follows that 



where pj n is expressed in terms of z 7 . From part (c) of Theorem 2, we have {Fj',7} 7 — 1 f° r 
j = 1, . . . , m. Therefore, log (pj, 7 ) < and hence dJC(z^,^)/d^ > for any finite 7 G 9? + 
and fixed z 7 . That is, for a fixed z G the only point at which the function /C(z,7) can 



Proof of Theorem 8. For exposition we drop the subscript 5] from Q* and Q; however it is 
understood that the mixing measures have a dependence on the fixed S. 

Part (1): We first establish that Q is an NPMLE. We start with creating a path in Q from 
Q to Q*, by letting Q a = (1 - a) Q + a Q* for a G [0, 1],Q* eg and Q satisfying the 
relation (27). Note that Q a G Q; therefore, Q is a convex set. The loglikelihood along this 
path satisfies 



for a G [0, 1] and for a fixed X. Therefore, l(Q a ; S) is a concave function for a fixed S. The 
directional derivative of l(Q a ', S) at g^ toward g Q * can be expressed, after simplification, 
as 



From (27), it follows that (A.9) is < for all Q* G Q and Q satisfying (27). This result 
combined with the concavity of Z(Q Q ; S) implies that Q is an NPMLE of Q G g. 



z 7 = arg max /C(z, 7) 




approach its supremum is at 7 = 00. 





(A.9) 



44 



Part (2): We establish the uniqueness of the NPMLE. Suppose Q and Q* are two NPMLEs 
of Q £ G, then 

I (Qa'i E) = (1 - a) I (Q; E) + al (Q*; E) 



for all a € [0, 1] and for a fixed E. This implies that the derivative dl(Q a ; E)/da at a = 
is exactly zero. This implies that Q* (and Q) is supported on . . . [i- e -> the set of 

zeroes of T>q(/j,; S)]. Furthermore, the second derivative 

r/ 2 



da 2 

which implies that 



z(Qa;£; 



= o, 

a=0 



^ {g Qi (y i ;S)-g g (y i ;5])} 2 
i=i {ag ei (y t ;S) + (l-a)g g ( yi ;S)} 2 



That is, 



g s *(yi; s ) = g g (yi;E) for all i = l,...,n. 

However, the linear independence of the vectors (y; S) for j = 1, . . . , K implies that 
{iri, . . . , ttk} = {vr^, . . . , vr^}. That is, Q* = Q; establishing the uniqueness of the NPMLE 
of Q for a fixed S. ■ 
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